Today, I'd like to start generating posts for our mathematics center as part of a marketing effort. In a past post to download and manage the data files. My goal is to "operationalize" that report by generating the corresponding graphs regularly along with a stub blog post; that is, we need to make sure the script is robust enough for our environment and (to ensure that) test it extensively.
Our regular reporting includes metrics for the busiest times of the day, busiest day of the week, and the breakdown of student activities, etc. and I figure there's no reason we shouldn't at least share the busy hours with students for planning purposes.
I currently have a script running on our university server that generates detailed report information in a CSV format; we'll need to extract only the data we're interested in, process it, and create the graphs.
I have some other plans for reports and algorithms to run on this data, so let's leverage some existing packages to do this work. I'm going to use Julia as my implementation language this time around since it's easy to get up-and-running in a CI situation, which will help generate these pages automatically in the future; there are other advantages, as well, nearly all of this would work just as well in Python. See this post for a comparison with similar data.
CSV.jl will provide basic data import functionality, and Plots.jl (most likely with the default backend) will be used for plotting.
using Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()
Pkg.API.precompile()
To prototype the process, we'll work with a small amount of data:
dataFileDir = joinpath(@__DIR__, "MAC-swipes")
dataFiles = filter(x->endswith(x, ".csv"), readdir(dataFileDir))
using Dates
# Define the date format for attendance data:
const fileDF = dateformat"Y-m-d H:M:S"
using CSV
# Note: The first row is noise (output from the generating script)
dataFile = joinpath(dataFileDir, dataFiles[1])
headerRow = startswith(readline(dataFile), "cmdOut") ? 2 : 1
f = CSV.File(dataFile, header=headerRow, normalizenames=true, dateformat=fileDF, types=Dict(8 => Union{Missing, DateTime}, 9 => Union{Missing, DateTime}))
Since we're going to run a few processes over the same data, we'll initialize all of them simultaneously and fill in the details as we iterate through the data file.
visitorIDs = Set{String}()
busyHoursData = zeros(Int64, 23)
busyDaysData = zeros(Int64, 7)
totalDailyTutoringTime = zeros(Float64, 7)
reasonForVisit = Dict{String, Int}()
reasonForVisitClass = Dict{String, Int}()
for row in f
studentID = row.FIT_ID
if !ismissing(studentID)
push!(visitorIDs, studentID)
end
timein = row.Time_In
if ismissing(timein)
continue
end
busyHoursData[Dates.hour(timein)] += 1
busyDaysData[Dates.dayofweek(timein)] += 1
# Extract context histogram information
rowContext = row.Context
if ismissing(rowContext)
reasonForVisit["Unknown"] += 1
else
if haskey(reasonForVisit, rowContext)
reasonForVisit[rowContext] += 1
else
reasonForVisit[rowContext] = 1
end
end
# Extract class histogram information
rowClass = row.Class
if ismissing(rowClass)
reasonForVisitClass["Unknown"] += 1
else
if haskey(reasonForVisitClass, rowClass)
reasonForVisitClass[rowClass] += 1
else
reasonForVisitClass[rowClass] = 1
end
end
timeout = row.Time_Out
if ismissing(timeout)
continue
end
timediff = timeout - timein
totalDailyTutoringTime[Dates.dayofweek(timein)] += timediff.value / (1000 * 60) # milliseconds -> minutes
end
Here's a brief summary of the activity in MAC during the given week:
using Printf
@printf "A total of %d unique visitors came into MAC this week, spending a collective %0.1f hours receiving services." length(visitorIDs) sum(totalDailyTutoringTime/60)
Now, let's create some plots...
using Plots